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ABSTRACT ^ ^ ^ ^ 

It is argued that, oy design, noro-ref erencea tests 

(/fTRT) and criterion-referenced tests (CRT) are -conceived with 
different frames of reference. Thpy are not totally exclusive of each 
other, but they do direct attention to different uses and references 
for information .^nd decision making. Their combined contributions 
allow a more detailed and comprehensive means of assessing the 
outism»ie-4i/ an Educational program, ft historical perspective is given 
of the two types of tests and NRTs are discussed as to sampling and 
purposes. Different types of tests are designed to sample different 
universes and noi^m-, objective-, and criterion- referenced tests are 
distinguishd& in \aspects of design, development, use, and 
interpretation. Several of the nationally-normed achievement tests 
may exhibit characteristics of both NRTs and CRTs to a greater or 
less<?r degree, ac«!?ording to ho*f CRTs ate defined. Criteria for 
evaluating educational programs, performance objectives, and the 
criteria of educatip-nal progress are discussed, as well as the 
feasibility of usirfg C?Ts in large-scale or national programs. 
(RC) 
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iJll. lORKST. TRI l S. BKANCHKS AND I.KAVKS-RKVISH KO- - 
NORM. DOMAIN. OBJKCTlVi: AND ( UlTKRION-RKFKRKNCKD ASSKSSM^r^TS 
HDUCAIIONAL ASSKSSMFNT AND KVAI.UATION 

"It IS iintait to jiillgc what our students have learned from that standardized test that 
doesn't measure the contents and emphases ot our eurrieulum." 

"\ou sa\' that the stuilents have mastered all those performanee ohjcetivt's. But, how 
well can thev pertorm in other situations and with other contents than the spepfics of your 
instruction?" .. ^ 

Ihc Tlu'sis I hcsc comnrents are illustrative ol the hi-polar argumeiits that have emcr^rcd regarding 

virtues anil !imitati()ns of 0()rni-retcrenced and criteri()n-reterenccd tests. Which side ot 

the argument attracts vou^ No matver! The purpose of this discourse is to argiic that, hy 
design, the NUT and CR) are conceiveiUu^th different frames of reference. They are not 
totally exclusive of each other, hut they ilo dTrect attention at different uses and inferences 
for uitorpretation and decision making. Moreover, we commend the notion that rather tjian 
viewing NK I and CiU f as adversaries seeking victory over each other, their combnicd 
contnbations allow a more vletailed and comprehensive means of assessing and evaluating 
the outcomes of an educational program. 

I, ,s imperative that the consideration of this concept of the different hut mutual 
contrihutions of CUT and NKI be based on assuran»c of the high quality of each, (he 
limitations of the NKI or CK! are easilv ivientified if the assessments are pporly or 
ambiguouslv constructed, administered ami scored. Ill-coneeived performance objectives 
spawn similar CRT items. Inappropriate or ilcfective items destrov the accuracy and value of 
the NK T just as well as inadec|uate. biased or undefined population -.uriplcs obliterate the 
possible usefulness uf the NKT. In the title's analogy to the forest, it is inappropriate to 
consider the argument unless we begin with iwo l..-'l'!n' trees of equal quality, herein 
referred to as the CK 1 and the NK I 

U„t„ru ,l llistoruallv. a form ..f CRT existed long before anv NRT The questions the tutor 

()l,,r,-a,ons asked of his student m Creek or medieval civilizations were examples ot the specific 

contents and purposes of instruction defining the content ot the examination that \^ould 
deternune the student's achievement. Otten die environment in which the tutoring occurred 
\Ms used to illustrate the objective of the lesson, whether it was philosophv or science. In 
(,ne region the (piestion about the temperament of man was asked by analv/ing the nearbv 
. oliNc trees: m another region the question was posed bv an anaiogx ilrawn from the canals 
that were used to ilistribute the river waters to the cultivateil tiehls. Ihe particular 
competencies and values of the tcAcher and the diverse demands of the regions or mode of 
hvipg in citv. hamlet or rural isolation made variable -definitions of what were relevant and 
m.pvrtant knowledge, skills or attitudes to be learned. I he criteria for progress of the 
learner were defined and presented and within the local situation. Such a procedure was 
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acccptftl and validated as meaningful and cftcctive l>ecause the learner, after his schooling 
was completed, hatl to cope vvjth knowledges, skills and attitudes of the local environment. 

The 20th century brought, among other things unprecedented mobiliiv, technolog> 
and urbanization. The children and youth experienced different education as they moved 
from region to region or from rural to urban environments. Moreover, youth was frequently 
educated in one context and soon moved to cope or find vocational adjustment in a 
different environment with different demands. During the 20th century, the standardized 
test was born out of the pressures to organi2e more effectively the manpower for the 
- demands of World War 1. The verbal components of these tests were soon attacked because 
of t}ieir bias for particular contents and environments assumed for the learner. These efforts 
by Army classification to devise norm-referenced common criteria were prompted by the 
obsenations of the variability of criteria of the individual examiner's judgments. At the 
same time, they demonstrated thcproblem of drawing inferences about the development of 
individuals with different experience and educational backgrounds. However, then and now, 
the critical element of the relation of educational background and accumulated learnings 
was imperfectly related to the tasks the individual would be required to cope with in his 
vocational and living: demands. 

Karly in the 20th century, Pintner and other psychologists made exhaustive studies of 
the comparability of the mental developments (and accumulated achievements) of persons 
in different cultures and continents. Their attempts were continuously confounded by the 
inability to devise a measure that would be culture-free or culture-fair for the diversity of 
content, context and purpose of education in the various cultures. In short, separate 
assessments were required within the various cultures, to monitor the effects of education 
and the development of persons in each culture. 

In the United States some of the early tests of achievement and mental ability were 
observed to produce different results in various regions. Particular item contents were 
singled out to demonstrate the bias of the item for or against individuals coming from 
different environments and with different educational emphasis. For example, one test item 
asked about the structure and uses of a single-tree. By the late 1930's, it was obvious that 
such content bad meaning and emphasis in the education of the agrarian population, while it 
was seldom experienced or discussed in the urban and suburi)an environments of education. 
Conversely, the item thai asked about the construction and uses of the escalator was readily 
seen as appropriate in uri)an education and relatively unknown in the rural. 

I hese limited illustrations merely identify the historic concern for content appropriate 
to the purposes and context of the local or regional educational program. Insofar as the 
person was educated within a local context in which he would make his life, the measures of 
the outcomes could be readily designed for those specific knowledges, skills and attitudes 
that would be locally valued and required. However, as mobility l>ecame a way of life, 
education was concerned with helping individuals develop knowlcilge and skills that 
provided more common currency in any region of the United States. As students moved 
from one institution to another and from one region to another, there was interest in 
developing measures that were general sur\eys of the common skills and knowledges 
iilentificd as important for coping with the inclusive culture of the country. 
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NornirrdcrciKca tests were ilcsignal to sur^^y the skills and knowledges that were 
generally eonunon to many or most ediicatit)naf^programs. And by design, although it has 
rarely l)een reeogni/ed, the standardized norm-retereneed test has an imperfect and 
incomplete congruence to any particular school program. 

I he construction ot the national, standardized NRT was based on surveys of contents, 
materials and anticipated outcomes of schools in ever>' region. Courses of study, curriculum 
guides, textbooks, instructional materials and educators' definitions were compiled and 
anaUved to idcntifv contents with the highest common incidence. Items of these nationally ^ 
standardized tests were designed as sur\ eys of skills and knowledges generally common to 
manv or most educational programs. 

After the test was constructed, there was the further need to determine what fate and 
degree of attainment would be found in student populations throughout the country, to 
obniin an answer to this question, the test publisher defined a sampling process which would 
(as nearly as possible) proportionately represent the rural, suburban and urban schools in all 
regions of the nation. The tests were administered to this "national sample," and the 
performance summarized in a distribution of scores. The distribution of scores is then 
eonverted into one or several normative scales to facilitate the description or characteri- 
zation of various degrees of success on the array ol items in the test. The norms-lhus 
describe the range and relative incidence of success (usually with emphasis upon average or 
modal performance) of the reference population which is the particular obtained sample of 

manv schools in manv regions. 

Ihere are few. if an\', tests that are not designed to sample a very large array ot 
contents ( his is not a singular frailty of tests, for the individual in making an evaluation of 
another person's performance is required to make a judgment from the sample of observed 
behavior; and he cannot obtain observations or receive information concerning all behavior 
of the inilividual in everv situation in which he is engaged. - 

Assessment and evaluation are basically restricted by the adequacy, representativeness 
and relevance of the data obtained. The sample may \k an inaccurate representation of the 
characteristic, or it may be unrepresentative of the behavior at another time, in^nother 
format or situation. 

A substantial amount of the concern with various types of tests and other assessments 
comes from the lack of understanding of what the technique -s designed to sample, as well as 
the improper interpretations which are made from the data. There is a common tendency to 
make precise classifications of human behavior fiom assessn^ent techniques that were not 
designed for such a purpose. Iacu with appropriate understanding of the test as a broad 
sur\ey, or as a restricted documentation of a specific act. there is still the tendency to want 
to speak with precise certaintv rather than with v.irying degrees of assurance. The basic 
neccssitv which requires sampliug also clcarlv requires interpretation that describes 
probability ami not finite certainty! 
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DijjcrcrJ \o pursue the clistinetion of norm-, domain.-, ohjeetive- and eriterion-refereneed 

Tvpe^ of measures in all aspects of design, development, use and interpretation would require a book 

Test^ Are of man)' pages, it is believed that this discourse may be shortened by using some figures to 

Designed to suggest the variety of purpose and design of the several types of tests, I he figures will be 

Sample Dif restricted to the nature of the samples that are commonly used l)y the various techniques to 

' ferent gatlier information al)out student behavior following an educational experience. 

L-niverses ligiire 1 presents the design of a norm-referenced test (for elementary grades 4, 5 and 

6) which is used to sur\'ey many student populations on those elements which are judged to 
be ''generally common'' anticipated outcomes of education. *Ihe illustrated design also 
suggests that the survey may be used for several ages and thus not precisely or exhaustively 
be concerned with one age or program. 

The illustration of the fourth grade reading domain (Figure 2) shows the (1) 
instructional materials (content and format), (2) instructional techniques, and (3) outcome^ 
objectives as retlectcd by the learning strategics and sequences of School A on the left side 
of the figure. On the right side of the figure are the generic cmegories of the reading universe 
found in consensus definitionsof reading. 

The lines with arrows show the typical match-mismatch of the test items of the generic 
categories with the specific contents of instruction in School A, However, it is alleged that 
the curriculum and instruction in School A are designed to attain the generic goals and 
objectives of the uni\ersc of reading. 

The norm-referenced test saptfpies some content from the four categories by content 
and format generally representative of generic consensus of what is included in reading, 

I he commercialh' developed criterion objectives and test items arc shown to assess 
seven of the ten skills in the School A program, while three of the items are not included for 
emphasis. 

I he ()l)jc-ctive-referenced test developed for grade 4 of School A provides an exact 
match to the objectives defined in the local reading continuum. . 

The criterion-referenced test for instruction in grade 4 of School A provides exact 
replication of the content, format and application used in the daily instructional activities. 
Obviously this test measures the attainment of the precise local reading experience in 
content and sequence. 

In the schema of the 4th grade reading universe, it may be observed that the 
norm-referenced test and commercial domain-referenced objectives and items are designed 
to sur\cv reading skills bv sampling the most generic definitions of reading. The precise 
content, vocabulary, skills, forma? or application could not have a perfect match with any 
local program. On the othcj hand, the NKI provides an opportunity to survey the 
generalized outcomes of many different programs of reading instruction. This may be 
viewed as important sur\ev information by those who exclaim. *M)on't bother me with the 
minutia of how you teach just give me c\'idencc that students have developed the ability to 
read a variety of materials they will come in contact with (beyond the materials in ^he daily 
instruction)/' 
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Figure 1 



SCHEMA OF A NORM-REFERENCED TEST 
FOR ELEMENTARY GRADES 4, 5 AND 6 



Composite Curriculum Domains 
(From Many Regions. Schools and 
Professional Sources) 
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of Current 
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1st 




Norms 
(Performance of 
Samples of Students 
in Many Regions and Schools) 



"The ranking of performance 
from the range and distribution 
of the performance of the 
reference population." 



• x's represent the samples drawn from very large composite curriculum domains, 

• Samples of the domains are used to develop survey test items, 

• The largest samples are drawn trom the 4th. 5th and 6th grade domains with very limited samples above 
and below the grades for intended survey testing. 
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Figure 2 



SCHEMA OF A 
FOURTH GRADE READING DOMAIN ASSESSMENT 



CrUerion R«f . Test 
Grade 4. School A 



Norm Bel, Test 
(Grades 4. B, 6) 




Exact replication of content 
f b^lWaremTappT ica t16 n found 
in test items and daily instruction) 



'Publishers items have no specific match to school A 4th grade reading content, format or sequence. 



It may also he i)l)scr\cil that School A's local program has been constructed to enaWe 
stiulents to develop the reading skills in the various generic reading categories. The dotted 
arrows running from the local program on the left of the schema to the generic categories 
indicate this planned relationship. 

From the standpoint of the evaluation of the instructional program; the schema 
suggests that the local criterion-referenced test constructed to exactly measure the local 
curriculum and the local objective-referenced test should provide information of student 
mastery of these contents. On the other hand, the local program is said to be desig.ned to 
develop student skills in the generic categories of word attack, comprehension and 
application. The NRT provides a general sur\'ey of these reading skills. 
Different The heated discussions of the virtues ajid limitations of norm-referenced and 

Purposes of criterion-referenced tests have generally ignored the different purposes and uses of these 

NRTiinJ CRT techniques and have I emphasized the varying success of the student popubrion and the 

congruence of the tesjt items to the students' instructional experience. As previously stated, " 
Multiple Pur- the NKT is designed to survey the relative attainment of students in Iferms of generally 

poses ami accepted skill and knowledge outcomes. The NR T is an e.xtert'.al measure to - pnn'tde 

Uses of Test' indications of the relative achievement of many populations in relation to a reference 

itig ami hViil- population that is hoped to include a proportioniU representation of students from all types 

lUftiou of environments and cultures of the nation. 

The C:R r and objective-based tests (generally of loca. design anil construction) are 
planned to assess the local students' attainment of the precise content, format and sequence 
of their instructional program experience. The primary purpose of such measurement is to 
determine which specific contents and objectives have been attained and to determine the 
progress the students have made on the sequential objectives of the local continuums in 
reading, math, etc. It is generally alleged there is no interest (or appropriate procedure) to 
ilctcrminc the relative ranking of students within or outside of the local school program. 
I he intent is to determine mastery of locally defined performance objectives and monitor 
indiviilual student progress on local curricular continuums. 

In the illustration in I igure 2, the norm referenced standardized .sur\'ey test is shown to 
sample the universe of common reading skills. While domain-, objective, and criterion- 
referenced tests arc given v;.rious definitions by different users, the illustration would 
suggest the following distinctions. A test of a domain may sample a particular sub-part of a 
larger linivcrse. Objeciivc-designcd tests are usually developed to assess the particular 
anticipated outcomes of a local or specific instructional program. The criterion-referenced 
measure in this illustration is constnicted to measure the mastery of the specific content, 
context and format of the local instructional program 

It is recogni/cd that the foregoing distinctions are not comm(»nly defined or 
exemplified by some recently de veloped tests given these measurement names. Certainly a 
portion of the lack of acceptance of these various instruments as contributing to more 
extensive issessment of student .ichievtment is due to the variety ot definitions and 
understandings of the purposes of each, 



While the atorcmcntioncd ditt'erencc<; in purpose and design ot NR'I and (IR 1 seem 
basic to the issue ot planning m assessment program, there arc further complexities that 
confound the issue. Not a small problem is the multitude ot ways criterion-referenced tests 
have been defined in the literature. The definitions are sufficiently different that a particular 
test may f e classified as a norm-referenced test by one definition and a criterion-referenced 
rest b\' another. Of even greater import is the fact that several of the nationally-normed 
achievement tests may exhibit characteristics of both NR I and CRl to a greater or lesser 
degree* according to the definition of CRT. 

Hambleton ;i,nd Novick have provided a thoughtful analysis of the issues and 
distinctions of NRT and CRT and conclude that it may be misleading to talk about NRT 
and CRT. They suggest that the results from cither type instrument may be explained with a 
norm referenced interpretation, criterion-referenced interpretation or both. What is needed 
is precise definition of the decision theoretic process from which the theory, purpose and 
use of the measurement are derived. 

Cronbach and Glescr have suggested thnt norm referenced measurement is useful in 
situatiors where one is interested in a "fixed quota" selection or ranking of individuals, 
while criterion-referenced measurement would be useful for **c]uota-frce'' assessment. 
However, some recent reports of the results of criterion-referenced measurement generated 
per cent of students passing various items, and the users were rapidly accumulating data that 
might be used as a local norm. This observation reinforces the suggestion that it is time to 
have the measurement theory and the types of uses and dccmom that are to be madii from 
the assessment data clearly defined and understood. Then it may be more appropriate to 
select or develop the test that will fit the use and interpretation desired. 

A substanti d basis for the argument over NRT and CRT seems to be found in the 
criteria that may be used to evaluate the effectiveness of educational programs. Individual 
svlu)ols have U udly condemned the *'norms" of the norm-rcfcrcnccd tests for being unfair 
to theii particular student population. The condemiiauon v^as both for the higher entry 
status of the modal reference population and for the norms which showed that particular 
school to have less than average ranking. In addition, l|)cal instructional staff became 
frustrated with the small increments of growth realized by the students on the normative 
scale as contrasted to local observations and assessments that were perceived as revealing 
substantial progress with the local instructional contents. 

Mandatory evaluations of specially funded programs and i;ic implementation or" 
"accountability'* procedures heightened the concern of administrators and instructional 
staff. 1 he pre- post model of testing with NRT was not producing increments of growth for 
educationally retarded student populations equal to or greater than normally achieving 
student populations. These results sl}/iuld not have been viewed with surpirse, for the entry 
characteristics of the retarded populations identified the lower growth increments and th? 
additional obstacles to academic attainment that were not present in the higher-achieving 
pojiulations. A thorough par^ligm of learning would certainly raise a question wit^^ the 
assumption that any learner, irrespective of entry characteristics, would have equal 
opportunity for an\' increment of academic ichie\'cmcnt. Ihe accumulated past learnings 
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from the cnvironmc-nt ;ind ',scli()ol have been shown- to be impressive ileter miners of 

subsequent learning. Unfortunately, achieving the mean or better ranking on a norm- 
referenced rest WAS perceived by many as the only criterion of success.- 

The concern for relative ranking soon produced questions about the appropriateness of 
the content and format of the no'rm-refercnecd test items. Studies quickly showed that a , 
certain per cent of the vocabulary o. content of the test items was not pr-scnt in a local set 
of instructional materials, and the students had never had practice in responding to the 
format of the test items. frequent reaction, was.to cry. "l-oul! The instrument rs no good 
for measuring the progress of students in the local school curriculum." Needless to say, such 
reactions reflected a lack of understanding of the norm-referenced test as a survey that 
sampled generally accepted academic outcomes across a refeience population of great 

divcPMty. . 

( he creation of performance objectives statipg specific contents, tormats. and ^ 
creditable behavior in terms of the local instructional program was viewed as a fair aijd exact 
method of determining student progress. Criterion-referenced or objective-referenced test 
items were then constructed to replicate exactly that specified in the local perTormance 
objectives. Review of numerous compilations of these performance oi)jectives and their 
referenced test items reveals wide differences in conceptualization and technical quality. 
While it mav be unfair to generalize grossly, it is observed that many performance objectives 
deal with extremely small, isolated elements of the skills of rcailing, math, coorjlination or 

peisonal-social behavior, etc. 

A review of the assessment data of several specially funded and innovative programs 
presents Results which suggest disparate evaluations of the impact of the program on the 
target student population. As an illustration, the test results from an elementary school 
program for educationally retarded children were summarized over a three-year-period The 
evaluation design called for beginning and end-of-year testing by both a NUT and a \pca.\ 
CK I . rhe objectivc-s of this program were stated as 1) 80% of the targeted student, will 
attain mastery of 80% of the performance objectives (CiR P items developed for each 
objective), and 2) the targeted students will attain 1.0 or more achievement on the grade 
equivalent scale of the norm-referenced test required by the funding agency. 

t he results of the first year showed the targeted population to have attained 80% or 
more success on 73'!* of the objectives. At the end of the second year. 81% of the students 
were reported attaining success on the objectives, and 83% attainment was shown for the 
third year. 

During the sam? period in annual pre- and post-testing with the norm referenced test, 
the mean ciiange in grade equivalents was .6. .7 and .7 in the first, seeonil .ind third years, 
respectively. The project staff observed that two years of growth had i)een attained in the 
tbae-year period. Phis was the same amount oCchange that had been observed in the three 
years prior to -the project To the project staff, the unchanging results on the 
norm-referenced tests were viewed as evidence that the tests were at fault and the "true" 
picture of growth was shown by the criterion items of the district's performance objectives. 



Another result was discovered by an independent evaluat^or who made a longitudinal 
analysis of the NRT results and tested random sample of the target students after 
completing three years of the progrart\.^ A randomly • selected group of Jhe criterion- 
referenced items was used for measufing the objectives in the 1st, 2nd and 3rd years. Th©^ 
beginning of tlie 1st year to the end "^f^the 3rd year NRT results were compared, and the 
difference was 1.7 on the grade equivalent seale.^is was three months less than the sum of 
the changes obscncd by the (ire-, post- annual testing. The CRT items also showed l6wer 
percentage master)' than had been reported in the three separate years. Of (Particular interest 
was the percentage of students showing mastery on the 1st, 2nd vnd 3rd year objectives. In 
this instance, 46% of the 1st year objectives, 53% of 2nd year objectives, and 61% of the 3rd 
year objectives were passed by the students in the fall semester following the cornptlction of 
three years in the project (in contrast to the 73%, 81% and 83% reported at theenaW th 
three years). These data suggest there was substantial forgetting even though the objectii^s 
dealt with academic skills that were thought to be continuously utilized in the Sequential 
curriculum continuum. 

It is improper to draw the conclusion that nationally standardized tests must bfe- 
norm-referencfid; nor should it be concluded that tests designed for the assessment of local 
objectives and criteria are automatically freed from any ranking or norming use and 
interpretations. The essential issue is the need for precise definition bf the design, use and 
interpretation for decisions that are.planned for the measurement. A test item typically has 
a defined response which is credited as masterv of a particular element or elements of 
learning. This is true for very limiteti or comprehensive objectives gf either national or local 
derivation. 

While the majority of nationally standardized tests have been associated with norm, 
referencing, it is quite feasible to conceptualize criterion-referenced tests in large-scale or 
national programs. Such a test would have items constructed to assess explicitly defined 
aspects of achievement, and the standardization of scoring would verify that the creditable 
behavior met the desired criterion of mastery. The use of such a test of objective or criterion 
referencing would probably be to determine whether a student or groups of students did or 
did noi demonstrate master)'. For the individual or a group, the assessment would describe 
which criteria or objectives were mastered and which were not. Some psychometricians have( 
suggested that existing standardized tests may be used in a criterion-referenced manner by'^ 
identifying the items lhat match the desired local outcome objectives and then scoring only 
those items for mastery. Such usage would offer a means off assessing the master)' of 
licsignatcd objectives for a class, school or institution without any concern for norms or a 
reference population. 

It is recognized that there are many technical problems involved in using either 
norm-referenced or criterion-referenced tests for making conclusions about the tfoegrowth: 
of student populations, I hose problems are more complex than may l>e adequately 
addressed in this paper. Suffice it to'^say, the reliability and validity of the measures are 
troublesome problems that plague those interested in very precise and parsimonious 
conclusions concerning short-term, annual, or longitudinal growth in ae uicmic achievement. 
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The issue of appropriate criteria of educational attainment is raised by these and other 
similar results of mcasuremcnl. One cogent question relates to whether educational 
programs are to be judged mainly by the student mastery of local curriculum content 
immediatdy following an 'instruttional sequence. Or is the effectiveness of. the program to 
be viewed in relation to perseverance of accumulated learnings designed in the local 
program? Another cogent question is whether the purpose of many if not all \oci\ 
educatir»n;il programs is to assist the learner in acquiring the skills, knowledges and attitudes 
he will need in coping with subsequent demands tn.school ancf society. Indeed, the issue is' 
whether the educational program is concerned with the learner acquiring and internalizing 
the skills so that they may be applied in many formats, contexts and applications. 

Many other questions may appropriately be asked about the purposes, implementation 
and outcoipes of education for which measurement and evaluation techniques ar*: desired. 
i:)efinition of the data needed for educational decisions becomes very important to give 
direction to the needed measurement theory.; How-ever^prm clarification, it 

appears the criterion-referenced or objective test items designed to assess a particular* 
* instructional sequence, as well as the norm-referenced survey of more generalizable 
outcomes, may make mutual but different contributions 6f data for educational decisions. 

It is common to observe the enthusiasm and effort that is mobilized for the production 
of a new implement or technique, even though there is inadequate definition of it^ purpose, 
use or appropriate interpretation of the results. E.xplicit measurement theory for the 
assessment of particular short- or long-term instructional experiences is a sorely needed road 
map. Such a map would provide direction for appropriate and effective construction and use 
of the norm-referenced and critet ion- or objective-referenced tests. 

As an epilogue, it should be mentioned that all past and present efforts to understand 
and assess human learning have measured only the tip of the iceberg visible above the water. 
A variety of techniques have been devised. At this time, new techniques and strategies arc 
being developed. /However, implements and techniques may be more effectively designed 
and used when there is a well-definc^J theory. Wit'iout the theory, potentially fine 
instruments may be used and interpreted with ncgativf effects. ' 

In quest of a theory, the road will probably lead through a forest of innovations. 
However, if wc ;ire to profit from the journey in measurement and evaluation. . we must 
tlcarh' identify the direction, terrain and destination as well as the artifacts that may be 
acquired along the way. Hopefully, the present ambiguity of the purpose and use of the 
criterion-referenced and norm-referenced tests may be seen in the perspective of the analogy 
of defining the forest, trees, branches and leaves of the assessment and evaluation of an 
educational program. 
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